Principle Component Analysis is widely used in data exploration, dimension reduction, data visualization. The aim is to transform original data into uncorrelated linear combinations of the original data while keeping the information contained in the data. High dimensional data tends to show clusters in lower dimensional view.
Clustering Analysis is another form of EDA. Here we are hoping to group data points which are close to each other within the groups and far away between different groups. Clustering using PC’s can be effective. Clustering analysis can be very subjective in the way we need to summarize the properties within each group.
Both PCA and Clustering Analysis are so called unsupervised learning. There is no response variables involved in the process.
For supervised learning, we try to find out how does a set of predictors relate to some response variable of the interest. Multiple regression is still by far, one of the most popular methods. We use a linear models as a working model for its simplicity and interpretability. It is important that we use domain knowledge as much as we can to determine the form of the response as well as the function format of the factors on the other hand.
NLSY79.csvbrca_subtype.csvbrca_x_patient.csvSelf-esteem generally describes a person’s overall sense of self-worthiness and personal value. It can play significant role in one’s motivation and success throughout the life. Factors that influence self-esteem can be inner thinking, health condition, age, life experiences etc. We will try to identify possible factors in our data that are related to the level of self-esteem.
In the well-cited National Longitudinal Study of Youth (NLSY79), it follows about 13,000 individuals and numerous individual-year information has been gathered through surveys. The survey data is open to public here. Among many variables we assembled a subset of variables including personal demographic variables in different years, household environment in 79, ASVAB test Scores in 81 and Self-Esteem scores in 81 and 87 respectively.
The data is store in NLSY79.csv.
Here are the description of variables:
Personal Demographic Variables
Household Environment
Variables Related to ASVAB test Scores in 1981
| Test | Description |
|---|---|
| AFQT | percentile score on the AFQT intelligence test in 1981 |
| Coding | score on the Coding Speed test in 1981 |
| Auto | score on the Automotive and Shop test in 1981 |
| Mechanic | score on the Mechanic test in 1981 |
| Elec | score on the Electronics Information test in 1981 |
| Science | score on the General Science test in 1981 |
| Math | score on the Math test in 1981 |
| Arith | score on the Arithmetic Reasoning test in 1981 |
| Word | score on the Word Knowledge Test in 1981 |
| Parag | score on the Paragraph Comprehension test in 1981 |
| Numer | score on the Numerical Operations test in 1981 |
Self-Esteem test 81 and 87
We have two sets of self-esteem test, one in 1981 and the other in
1987. Each set has same 10 questions. They are labeled as
Esteem81 and Esteem87 respectively followed by
the question number. For example, Esteem81_1 is Esteem
question 1 in 81.
The following 10 questions are answered as 1: strongly agree, 2: agree, 3: disagree, 4: strongly disagree
Load the data. Do a quick EDA to get familiar with the data set. Pay attention to the unit of each variable. Are there any missing values?
Observations:
Some of the income values are negative – this doesn’t seem right.
Many Jobs5 are missing values.
Some of the HeightFeet05 are negative which doesn’t make sense.
A number of test score values are 0, which I would assume means these people never took these exams.
All of the esteem columns seem to be missing no entries.
Let concentrate on Esteem scores evaluated in 87.
Esteem variables.
Pay attention to missing values, any peculiar numbers etc. How do you
fix problems discovered if there is any? Briefly describe what you have
done for the data preparation.
Observations:
There are no missing values.
The average of the possible respondent responses (1-4) is 2.5. However, all of the questions either skew agree (i.e., a 1.5 mean) or skew disagree (i.e., a 3.5 mean).
One possible issue is that when a respondent responses agree, their response could indicate either low or high self esteem based on the question asked. For example, a person with strong self esteem would response with a low number (1 or 2) to a question like Question 1: “I am a person of worth” but with a high number (3 or 4) to a question like Question 5: “I do not have much to be proud of.” We will solve this issue in the following step.
temp <- read.csv('data/NLSY79.csv', header = T, stringsAsFactors = F)
summary(temp$Esteem87_1)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 1.00 1.38 2.00 4.00
summary(temp$Esteem87_2)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 1.0 1.0 1.4 2.0 4.0
summary(temp$Esteem87_3)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 4.00 3.58 4.00 4.00
summary(temp$Esteem87_4)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 1.0 1.0 1.5 2.0 4.0
summary(temp$Esteem87_5)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 4.00 3.53 4.00 4.00
summary(temp$Esteem87_6)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 2.00 1.59 2.00 4.00
summary(temp$Esteem87_7)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 2.00 1.72 2.00 4.00
summary(temp$Esteem87_8)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 3.0 3.0 3.1 4.0 4.0
summary(temp$Esteem87_9)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 3.00 3.06 4.00 4.00
summary(temp$Esteem87_10)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 3.00 3.37 4.00 4.00
data.esteem, then
data.esteem[, c(1, 2, 4, 6, 7)] <- 5 - data.esteem[, c(1, 2, 4, 6, 7)]
to reverse the score.)data.esteem <- data.frame(temp$Esteem87_1,temp$Esteem87_2,temp$Esteem87_3,temp$Esteem87_4,temp$Esteem87_5,temp$Esteem87_6,temp$Esteem87_7,temp$Esteem87_8,temp$Esteem87_9,temp$Esteem87_10)
colnames(data.esteem)[1] = "q1"
colnames(data.esteem)[2] = "q2"
colnames(data.esteem)[3] = "q3"
colnames(data.esteem)[4] = "q4"
colnames(data.esteem)[5] = "q5"
colnames(data.esteem)[6] = "q6"
colnames(data.esteem)[7] = "q7"
colnames(data.esteem)[8] = "q8"
colnames(data.esteem)[9] = "q9"
colnames(data.esteem)[10] = "q10"
data.esteem[, c(1,2,4,6,7)] <- 5 - data.esteem[, c(1,2,4,6,7)]Plotting the respondents responses using a stacked bar graph we see that most respondents have higher self esteem and that very few respondents were putting 1s (less than 500) compared to 4s (greater than 10,000).
Question 2 has the highest mean (3.6), while Question 9 has the lowest mean (3.06).
## Warning in melt(data.esteem): The melt generic in data.table has been
## passed a data.frame and will attempt to redirect to the relevant reshape2
## method; please note that reshape2 is deprecated, and this redirection is now
## deprecated as well. To continue using melt methods from reshape2 while both
## libraries are attached, e.g. melt.list, you can prepend the namespace like
## reshape2::melt(data.esteem). In the next version, this warning will become an
## error.
## No id variables; using all as measure variables
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 4.00 3.62 4.00 4.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 3.0 4.0 3.6 4.0 4.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 4.00 3.58 4.00 4.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 3.0 4.0 3.5 4.0 4.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 4.00 3.53 4.00 4.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 3.00 3.41 4.00 4.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 3.00 3.28 4.00 4.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 3.0 3.0 3.1 4.0 4.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 3.00 3.06 4.00 4.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 3.00 3.37 4.00 4.00
There is a positive coorelation between all esteem scores. This reachnes from 0.24 (q1 and q9) to 0.7 (q1 and q2).
data.esteem
res <- cor(data.esteem)
round(res, 2)## [1] "sdev" "rotation" "center" "scale" "x"
a) Report the PC1 and PC2 loadings. Are they unit vectors? Are they orthogonal?
The two loadings are orthoginal with unit 1.
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| q1 | 0.324 | -0.445 | 0.107 | -0.216 | 0.237 | -0.323 | 0.012 | -0.081 | -0.011 | 0.687 |
| q2 | 0.333 | -0.428 | 0.055 | -0.234 | 0.199 | -0.277 | 0.063 | -0.039 | 0.080 | -0.721 |
| q3 | 0.322 | 0.011 | 0.513 | 0.245 | -0.172 | 0.007 | -0.621 | 0.395 | -0.033 | -0.028 |
| q4 | 0.324 | -0.288 | -0.109 | -0.211 | -0.093 | 0.827 | 0.126 | 0.174 | 0.120 | 0.052 |
| q5 | 0.315 | 0.079 | 0.451 | 0.471 | -0.125 | 0.019 | 0.645 | -0.184 | -0.059 | -0.002 |
| q6 | 0.347 | -0.049 | -0.419 | 0.181 | -0.201 | 0.014 | -0.195 | -0.306 | -0.703 | -0.042 |
| q7 | 0.315 | 0.020 | -0.550 | 0.335 | -0.275 | -0.278 | 0.064 | 0.279 | 0.498 | 0.059 |
| q8 | 0.280 | 0.362 | -0.140 | 0.236 | 0.824 | 0.157 | -0.096 | 0.004 | 0.050 | -0.009 |
| q9 | 0.277 | 0.492 | -0.002 | -0.516 | -0.081 | -0.177 | 0.278 | 0.464 | -0.290 | 0.015 |
| q10 | 0.318 | 0.392 | 0.102 | -0.323 | -0.222 | 0.026 | -0.225 | -0.620 | 0.381 | 0.012 |
## q1 q2 q3 q4 q5 q6 q7 q8 q9 q10
## 0.324 0.333 0.322 0.324 0.315 0.347 0.315 0.280 0.277 0.318
## q1 q2 q3 q4 q5 q6 q7 q8 q9 q10
## -0.4452 -0.4283 0.0115 -0.2877 0.0793 -0.0492 0.0196 0.3619 0.4917 0.3918
b) Are there good interpretations for PC1 and PC2? (If loadings are all negative, take the positive loadings for the ease of interpretation)
PC1 interpretation: all of the coefficients are positive and around 0.3. This means there is a positive correlation between almost oll of the variables. So a growing positive alue in PC1 means a rather uniform gorwth of values in al of the varirables and this means that as much a person has a higher value in PC1, they are likeley to have a higher response/answer (1-4) to almost all of the 10 questions. This says that the higher a person’s response (1-4) to the 10 questions, the higher their self-esteem, which makes sense. Since all of the coefficients are around 0.3, PC1 is proportional to the total of the 10 answers/responses.
PC 2 interpretation:
The PC1 is the linear combination of the 10 scores which minimizes the total squared perpendicular distance.
for each subject, their PC1 score is 0.324 x (q1 response) + 0.333 x (q2 response) + 0.322 x (q3 response) + 0.324 x (q4 response) + 0.315 x (q5 response) + 0.347 x (q6 response) + 0.315 x (q7 response) + 0.280 x (q8 response) + 0.277 x (q9 response) + 0.318 x (q10 response).
Yes the PC1 scores and PC2 scores are uncorrelated. PC1 scores has greater variance than PC2 scores.
From a scree plot of the variances of each pc we see that a large portion of the total variance is explained by the leading principle component (0.469 of the total variance). The variances of each PC are decreasing (i.e., 0.469 from PC1 while 0.0294 from PC10).
summary(pc.10)$importance## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 2.166 1.102 0.911 0.8090 0.7699 0.7037 0.6714 0.6346
## Proportion of Variance 0.469 0.121 0.083 0.0654 0.0593 0.0495 0.0451 0.0403
## Cumulative Proportion 0.469 0.590 0.673 0.7388 0.7981 0.8477 0.8927 0.9330
## PC9 PC10
## Standard deviation 0.6133 0.5421
## Proportion of Variance 0.0376 0.0294
## Cumulative Proportion 0.9706 1.0000
plot(pc.10) # variances of each pcplot(summary(pc.10)$importance[2,], # scree plot of PVEs
ylab="PVE",
xlab="Number of PCs",
pch = 16,
main="Scree plot of PVE")f) Also plot CPVE (Cumulative Proportion of Variance Explained). What proportion of the variance in the data is explained by the first two principal components?
Plotting the CPVE, we see that 0.59 of the total variance in the data is explained by the first two principal components.
plot(summary(pc.10)$importance[3,],
ylab="Cumulative PVE",
xlab="Number of PCs",
pch = 16,
main="Scree plot of Cumulative PVE")g) PC’s provide us with a low dimensional view of the self-esteem scores. Use a biplot with the first two PC's to display the data. Give an interpretation of PC1 and PC2 from the plot. (try `ggbiplot` if you could, much prettier!)
A biplot with PC1 and PC2 indicated that a) PC1 loadings are similar in magnitudes and with same signs, b) PC2 captures difference between total of question 8, 9, and 10 and total of question 1, 2, and 4. Questions 3, 5, 6, 7 have little affect, c) questions 8, 9, and 10 are highly correlated and so are questions 1, 2, and 4.
limx <- c(-0,0.03)
limy <- c(-0.05,0.05)
biplot(pc.10,
xlim=limx,
ylim=limy,
main="Biplot of PC1 and PC2")Apply k-means to cluster subjects on the original esteem scores
Based on the “Elbow method” and the result of our factoextra::fviz_nbclust call, we will use k=2 to have two clusters.
set.seed(0)
factoextra::fviz_nbclust(data.esteem[,-1], kmeans, method = "wss")b) Can you summarize common features within each cluster?
* Participants within each cluster will have had similar responses to the 10 questions as others in their cluster.
c) Can you visualize the clusters with somewhat clear boundaries? You may try different pairs of variables and different PC pairs of the esteem scores.
as.data.frame(pc.10$x) %>%
ggplot(aes(x=PC1, y=PC10)) +
geom_point()+
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0) +
ggtitle("PC2 vs. PC1") +
theme_bw()pairs(pc.10$x, xlim=c(-4, 4), ylim=c(-4, 4), col=rainbow(6), pch=16)We now try to find out what factors are related to self-esteem? PC1 of all the Esteem scores is a good variable to summarize one’s esteem scores. We take PC1 as our response variable.
EDA the data set first.
Personal information: gender, education (05), log(income) in 87, job type in 87. Weight05 (lb) and HeightFeet05 together with Heightinch05. One way to summarize one’s weight and height is via Body Mass Index which is defined as the body mass divided by the square of the body height, and is universally expressed in units of kg/m². Note, you need to create BMI first. Then may include it as one possible predictor.
Household environment: Imagazine, Inewspaper, Ilibrary, MotherEd,
FatherEd, FamilyIncome78. Do set indicators Imagazine,
Inewspaper and Ilibrary as factors.
You may use PC1 of ASVAB as level of intelligence
# eda
esteemFactors <- data.frame(temp$Gender, temp$Education05, temp$Income87, temp$Job05, temp$Weight05, temp$HeightFeet05, temp$HeightInch05, factor(temp$Imagazine), factor(temp$Inewspaper), factor(temp$Ilibrary), temp$MotherEd, temp$FatherEd, temp$FamilyIncome78)
esteemFactors$totalHeightMeters5 <- ((esteemFactors$temp.HeightFeet05 * 12)+ esteemFactors$temp.HeightInch05) * 0.0254
esteemFactors$BMI05 <- (esteemFactors$temp.Weight05 * 0.453592) / (esteemFactors$totalHeightMeters5^2)
pc.10.loading[,1] # 0.324 0.333 0.322 0.324 0.315 0.347 0.315 0.280 0.277 0.318 ## q1 q2 q3 q4 q5 q6 q7 q8 q9 q10
## 0.324 0.333 0.322 0.324 0.315 0.347 0.315 0.280 0.277 0.318
esteemFactors$PC1 <- data.frame(data.esteem$q1 * 0.324 + data.esteem$q2 * 0.333 + data.esteem$q3 * 0.322 + data.esteem$q4 * 0.324 + data.esteem$q5 * 0.315 + data.esteem$q6 * 0.347 + data.esteem$q7 * 0.315 + data.esteem$q8 * 0.280 + data.esteem$q9 * 0.277 + data.esteem$q10 * 0.318)
colnames(esteemFactors)[1] = "gender"
colnames(esteemFactors)[2] = "Education05"
colnames(esteemFactors)[3] = "Income87"
colnames(esteemFactors)[4] = "Job05"
colnames(esteemFactors)[5] = "Weight05"
colnames(esteemFactors)[6] = "HeightFeet05"
colnames(esteemFactors)[7] = "HeightInch05"
colnames(esteemFactors)[8] = "Imagazine"
colnames(esteemFactors)[9] = "Inewspaper"
colnames(esteemFactors)[10] = "Ilibrary"
colnames(esteemFactors)[11] = "MotherEd"
colnames(esteemFactors)[12] = "FatherEd"
colnames(esteemFactors)[13] = "FamilyIncome78"
colnames(esteemFactors)[16] = "PC1"
data.ASVAB <- data.frame(temp$Science,temp$Arith,temp$Word,temp$Parag,temp$Number,temp$Coding,temp$Auto,temp$Math,temp$Mechanic,temp$Elec,temp$AFQT)
pc.ASVAB <- prcomp(data.ASVAB, scale=TRUE)
pc.ASVAB.loading <- pc.ASVAB$rotation
knitr::kable(pc.ASVAB.loading)| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| temp.Science | 0.328 | -0.159 | 0.089 | -0.236 | 0.011 | -0.464 | -0.476 | -0.205 | -0.496 | -0.270 | -0.007 |
| temp.Arith | 0.336 | 0.037 | 0.162 | 0.413 | 0.030 | 0.019 | 0.052 | 0.240 | 0.262 | -0.702 | -0.258 |
| temp.Word | 0.330 | 0.042 | 0.153 | -0.471 | -0.001 | -0.151 | -0.198 | -0.070 | 0.705 | 0.209 | -0.191 |
| temp.Parag | 0.310 | 0.174 | 0.225 | -0.465 | -0.050 | 0.589 | 0.224 | 0.097 | -0.368 | -0.048 | -0.243 |
| temp.Number | 0.255 | 0.430 | -0.377 | 0.144 | -0.731 | 0.006 | -0.051 | -0.219 | 0.000 | 0.020 | 0.033 |
| temp.Coding | 0.217 | 0.491 | -0.561 | -0.081 | 0.603 | -0.099 | 0.038 | 0.112 | -0.043 | -0.017 | 0.048 |
| temp.Auto | 0.247 | -0.487 | -0.413 | -0.015 | -0.154 | 0.158 | -0.241 | 0.629 | -0.014 | 0.162 | 0.043 |
| temp.Math | 0.322 | 0.133 | 0.299 | 0.468 | 0.090 | -0.189 | 0.014 | 0.124 | -0.203 | 0.596 | -0.339 |
| temp.Mechanic | 0.291 | -0.349 | -0.188 | 0.265 | 0.247 | 0.456 | -0.138 | -0.623 | 0.077 | 0.067 | -0.011 |
| temp.Elec | 0.297 | -0.355 | -0.154 | -0.112 | -0.065 | -0.359 | 0.774 | -0.127 | -0.022 | 0.001 | 0.036 |
| temp.AFQT | 0.353 | 0.114 | 0.341 | 0.079 | 0.040 | 0.080 | 0.014 | 0.100 | 0.053 | 0.049 | 0.847 |
pc.ASVAB.loading[,1] # 0.328 0.336 0.330 0.310 0.255 0.217 0.247 0.322 0.291 0.297 0.353 ## temp.Science temp.Arith temp.Word temp.Parag temp.Number
## 0.328 0.336 0.330 0.310 0.255
## temp.Coding temp.Auto temp.Math temp.Mechanic temp.Elec
## 0.217 0.247 0.322 0.291 0.297
## temp.AFQT
## 0.353
esteemFactors$PC1ASVAB <- data.frame(temp$Science*0.328+temp$Arith*0.336+temp$Word*0.330+temp$Parag*0.310+temp$Number*0.255+temp$Coding*0.217+temp$Auto*0.247+temp$Math*0.322+temp$Mechanic*0.291+temp$Elec*0.297+temp$AFQT*0.353)b) Run a few regression models between PC1 of all the esteem scores and suitable variables listed in a). Find a final best model with your own criterion.
- How did you land this model? Run a model diagnosis to see if the linear model assumptions are reasonably met.
- Write a summary of your findings. In particular, explain what and how the variables in the model affect one's self-esteem.
We chose the model with all variables. Running a model diagnosis, we see that both the Residuals vs Fitted and qqnormal plots look fine.
fit1 <- lm(unlist(PC1) ~ unlist(PC1ASVAB), data=esteemFactors)
fit2 <- lm(unlist(PC1) ~ Education05, data=esteemFactors)
fit3 <- lm(unlist(PC1) ~ FamilyIncome78, data=esteemFactors)
fit.all <- lm(unlist(PC1) ~ unlist(PC1ASVAB) + gender + Education05 + Income87 + Job05 + Weight05 + HeightFeet05 + HeightInch05 + Imagazine + Inewspaper + Ilibrary + MotherEd + FatherEd + FamilyIncome78 + totalHeightMeters5 + BMI05 + unlist(PC1ASVAB), data=esteemFactors)
summary(fit1)##
## Call:
## lm(formula = unlist(PC1) ~ unlist(PC1ASVAB), data = esteemFactors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.881 -0.966 0.007 1.046 2.861
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.52753 0.08392 113.5 <2e-16 ***
## unlist(PC1ASVAB) 0.01590 0.00103 15.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.21 on 2429 degrees of freedom
## Multiple R-squared: 0.0899, Adjusted R-squared: 0.0896
## F-statistic: 240 on 1 and 2429 DF, p-value: <2e-16
summary(fit2)##
## Call:
## lm(formula = unlist(PC1) ~ Education05, data = esteemFactors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.481 -1.052 -0.022 1.017 2.502
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.92039 0.14069 63.4 <2e-16 ***
## Education05 0.13301 0.00995 13.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.23 on 2429 degrees of freedom
## Multiple R-squared: 0.0685, Adjusted R-squared: 0.0681
## F-statistic: 179 on 1 and 2429 DF, p-value: <2e-16
summary(fit.all)##
## Call:
## lm(formula = unlist(PC1) ~ unlist(PC1ASVAB) + gender + Education05 +
## Income87 + Job05 + Weight05 + HeightFeet05 + HeightInch05 +
## Imagazine + Inewspaper + Ilibrary + MotherEd + FatherEd +
## FamilyIncome78 + totalHeightMeters5 + BMI05 + unlist(PC1ASVAB),
## data = esteemFactors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.949 -0.909 0.000 0.972 3.025
##
## Coefficients: (1 not defined because of singularities)
## Estimate
## (Intercept) 8.84e+00
## unlist(PC1ASVAB) 7.58e-03
## gendermale 1.20e-01
## Education05 5.01e-02
## Income87 9.94e-06
## Job0510 TO 430: Executive, Administrative and Managerial Occupations 3.58e-01
## Job051000 TO 1240: Mathematical and Computer Scientists 4.06e-01
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians -7.48e-02
## Job051600 TO 1760: Physical Scientists -9.26e-01
## Job051800 TO 1860: Social Scientists and Related Workers -3.05e-01
## Job051900 TO 1960: Life, Physical and Social Science Technicians 1.48e-01
## Job052000 TO 2060: Counselors, Sociala and Religious Workers 1.47e-01
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers 2.09e-01
## Job052200 TO 2340: Teachers 2.12e-01
## Job052400 TO 2550: Education, Training and Library Workers 1.98e-01
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers 6.99e-01
## Job052800 TO 2960: Media and Communications Workers 2.44e-01
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners 4.02e-01
## Job053300 TO 3650: Health Care Technical and Support Occupations -1.22e-01
## Job053700 TO 3950: Protective Service Occupations 6.17e-01
## Job054000 TO 4160: Food Preparation and Serving Related Occupations -1.37e-01
## Job054200 TO 4250: Cleaning and Building Service Occupations -1.48e-01
## Job054300 TO 4430: Entertainment Attendants and Related Workers -6.98e-01
## Job054500 TO 4650: Personal Care and Service Workers 2.55e-01
## Job054700 TO 4960: Sales and Related Workers 2.41e-01
## Job05500 TO 950: Management Related Occupations 4.82e-01
## Job055000 TO 5930: Office and Administrative Support Workers 2.71e-01
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations -7.88e-02
## Job056200 TO 6940: Construction Trade and Extraction Workers 5.38e-03
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers 1.11e-01
## Job057700 TO 7750: Production and Operating Workers 1.42e-01
## Job057800 TO 7850: Food Preparation Occupations 1.31e-01
## Job057900 TO 8960: Setters, Operators and Tenders 1.38e-01
## Job059000 TO 9750: Transportation and Material Moving Workers -1.08e-01
## Job059990: Uncodeable -8.03e-02
## Weight05 -1.34e-04
## HeightFeet05 5.68e-03
## HeightInch05 -2.11e-03
## Imagazine1 -1.02e-02
## Inewspaper1 1.60e-01
## Ilibrary1 8.20e-02
## MotherEd 1.24e-02
## FatherEd -1.71e-04
## FamilyIncome78 8.07e-07
## totalHeightMeters5 NA
## BMI05 -3.50e-03
## Std. Error
## (Intercept) 5.56e-01
## unlist(PC1ASVAB) 1.40e-03
## gendermale 7.00e-02
## Education05 1.37e-02
## Income87 2.27e-06
## Job0510 TO 430: Executive, Administrative and Managerial Occupations 1.72e-01
## Job051000 TO 1240: Mathematical and Computer Scientists 2.21e-01
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians 2.30e-01
## Job051600 TO 1760: Physical Scientists 6.18e-01
## Job051800 TO 1860: Social Scientists and Related Workers 5.12e-01
## Job051900 TO 1960: Life, Physical and Social Science Technicians 4.78e-01
## Job052000 TO 2060: Counselors, Sociala and Religious Workers 2.47e-01
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers 3.49e-01
## Job052200 TO 2340: Teachers 1.98e-01
## Job052400 TO 2550: Education, Training and Library Workers 2.73e-01
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers 2.92e-01
## Job052800 TO 2960: Media and Communications Workers 3.67e-01
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners 2.14e-01
## Job053300 TO 3650: Health Care Technical and Support Occupations 2.00e-01
## Job053700 TO 3950: Protective Service Occupations 2.28e-01
## Job054000 TO 4160: Food Preparation and Serving Related Occupations 2.16e-01
## Job054200 TO 4250: Cleaning and Building Service Occupations 2.17e-01
## Job054300 TO 4430: Entertainment Attendants and Related Workers 4.10e-01
## Job054500 TO 4650: Personal Care and Service Workers 2.43e-01
## Job054700 TO 4960: Sales and Related Workers 1.80e-01
## Job05500 TO 950: Management Related Occupations 1.97e-01
## Job055000 TO 5930: Office and Administrative Support Workers 1.72e-01
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations 4.31e-01
## Job056200 TO 6940: Construction Trade and Extraction Workers 1.93e-01
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers 1.99e-01
## Job057700 TO 7750: Production and Operating Workers 2.34e-01
## Job057800 TO 7850: Food Preparation Occupations 6.16e-01
## Job057900 TO 8960: Setters, Operators and Tenders 1.97e-01
## Job059000 TO 9750: Transportation and Material Moving Workers 1.96e-01
## Job059990: Uncodeable 1.20e+00
## Weight05 1.60e-03
## HeightFeet05 8.96e-02
## HeightInch05 9.95e-03
## Imagazine1 6.01e-02
## Inewspaper1 7.75e-02
## Ilibrary1 6.12e-02
## MotherEd 1.25e-02
## FatherEd 9.29e-03
## FamilyIncome78 1.95e-06
## totalHeightMeters5 NA
## BMI05 9.53e-03
## t value
## (Intercept) 15.90
## unlist(PC1ASVAB) 5.40
## gendermale 1.72
## Education05 3.66
## Income87 4.38
## Job0510 TO 430: Executive, Administrative and Managerial Occupations 2.08
## Job051000 TO 1240: Mathematical and Computer Scientists 1.84
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians -0.32
## Job051600 TO 1760: Physical Scientists -1.50
## Job051800 TO 1860: Social Scientists and Related Workers -0.59
## Job051900 TO 1960: Life, Physical and Social Science Technicians 0.31
## Job052000 TO 2060: Counselors, Sociala and Religious Workers 0.60
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers 0.60
## Job052200 TO 2340: Teachers 1.07
## Job052400 TO 2550: Education, Training and Library Workers 0.72
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers 2.40
## Job052800 TO 2960: Media and Communications Workers 0.66
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners 1.87
## Job053300 TO 3650: Health Care Technical and Support Occupations -0.61
## Job053700 TO 3950: Protective Service Occupations 2.71
## Job054000 TO 4160: Food Preparation and Serving Related Occupations -0.64
## Job054200 TO 4250: Cleaning and Building Service Occupations -0.68
## Job054300 TO 4430: Entertainment Attendants and Related Workers -1.70
## Job054500 TO 4650: Personal Care and Service Workers 1.05
## Job054700 TO 4960: Sales and Related Workers 1.34
## Job05500 TO 950: Management Related Occupations 2.44
## Job055000 TO 5930: Office and Administrative Support Workers 1.57
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations -0.18
## Job056200 TO 6940: Construction Trade and Extraction Workers 0.03
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers 0.56
## Job057700 TO 7750: Production and Operating Workers 0.61
## Job057800 TO 7850: Food Preparation Occupations 0.21
## Job057900 TO 8960: Setters, Operators and Tenders 0.70
## Job059000 TO 9750: Transportation and Material Moving Workers -0.55
## Job059990: Uncodeable -0.07
## Weight05 -0.08
## HeightFeet05 0.06
## HeightInch05 -0.21
## Imagazine1 -0.17
## Inewspaper1 2.06
## Ilibrary1 1.34
## MotherEd 0.99
## FatherEd -0.02
## FamilyIncome78 0.41
## totalHeightMeters5 NA
## BMI05 -0.37
## Pr(>|t|)
## (Intercept) < 2e-16
## unlist(PC1ASVAB) 7.2e-08
## gendermale 0.08608
## Education05 0.00026
## Income87 1.2e-05
## Job0510 TO 430: Executive, Administrative and Managerial Occupations 0.03742
## Job051000 TO 1240: Mathematical and Computer Scientists 0.06591
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians 0.74549
## Job051600 TO 1760: Physical Scientists 0.13396
## Job051800 TO 1860: Social Scientists and Related Workers 0.55221
## Job051900 TO 1960: Life, Physical and Social Science Technicians 0.75762
## Job052000 TO 2060: Counselors, Sociala and Religious Workers 0.55171
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers 0.54960
## Job052200 TO 2340: Teachers 0.28614
## Job052400 TO 2550: Education, Training and Library Workers 0.46870
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers 0.01668
## Job052800 TO 2960: Media and Communications Workers 0.50618
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners 0.06104
## Job053300 TO 3650: Health Care Technical and Support Occupations 0.54339
## Job053700 TO 3950: Protective Service Occupations 0.00688
## Job054000 TO 4160: Food Preparation and Serving Related Occupations 0.52483
## Job054200 TO 4250: Cleaning and Building Service Occupations 0.49666
## Job054300 TO 4430: Entertainment Attendants and Related Workers 0.08847
## Job054500 TO 4650: Personal Care and Service Workers 0.29456
## Job054700 TO 4960: Sales and Related Workers 0.18032
## Job05500 TO 950: Management Related Occupations 0.01472
## Job055000 TO 5930: Office and Administrative Support Workers 0.11566
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations 0.85489
## Job056200 TO 6940: Construction Trade and Extraction Workers 0.97771
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers 0.57738
## Job057700 TO 7750: Production and Operating Workers 0.54521
## Job057800 TO 7850: Food Preparation Occupations 0.83131
## Job057900 TO 8960: Setters, Operators and Tenders 0.48162
## Job059000 TO 9750: Transportation and Material Moving Workers 0.57978
## Job059990: Uncodeable 0.94664
## Weight05 0.93356
## HeightFeet05 0.94946
## HeightInch05 0.83232
## Imagazine1 0.86487
## Inewspaper1 0.03950
## Ilibrary1 0.18075
## MotherEd 0.32191
## FatherEd 0.98527
## FamilyIncome78 0.67833
## totalHeightMeters5 NA
## BMI05 0.71388
##
## (Intercept) ***
## unlist(PC1ASVAB) ***
## gendermale .
## Education05 ***
## Income87 ***
## Job0510 TO 430: Executive, Administrative and Managerial Occupations *
## Job051000 TO 1240: Mathematical and Computer Scientists .
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians
## Job051600 TO 1760: Physical Scientists
## Job051800 TO 1860: Social Scientists and Related Workers
## Job051900 TO 1960: Life, Physical and Social Science Technicians
## Job052000 TO 2060: Counselors, Sociala and Religious Workers
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers
## Job052200 TO 2340: Teachers
## Job052400 TO 2550: Education, Training and Library Workers
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers *
## Job052800 TO 2960: Media and Communications Workers
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners .
## Job053300 TO 3650: Health Care Technical and Support Occupations
## Job053700 TO 3950: Protective Service Occupations **
## Job054000 TO 4160: Food Preparation and Serving Related Occupations
## Job054200 TO 4250: Cleaning and Building Service Occupations
## Job054300 TO 4430: Entertainment Attendants and Related Workers .
## Job054500 TO 4650: Personal Care and Service Workers
## Job054700 TO 4960: Sales and Related Workers
## Job05500 TO 950: Management Related Occupations *
## Job055000 TO 5930: Office and Administrative Support Workers
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations
## Job056200 TO 6940: Construction Trade and Extraction Workers
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers
## Job057700 TO 7750: Production and Operating Workers
## Job057800 TO 7850: Food Preparation Occupations
## Job057900 TO 8960: Setters, Operators and Tenders
## Job059000 TO 9750: Transportation and Material Moving Workers
## Job059990: Uncodeable
## Weight05
## HeightFeet05
## HeightInch05
## Imagazine1
## Inewspaper1 *
## Ilibrary1
## MotherEd
## FatherEd
## FamilyIncome78
## totalHeightMeters5
## BMI05
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.19 on 2386 degrees of freedom
## Multiple R-squared: 0.143, Adjusted R-squared: 0.127
## F-statistic: 9.07 on 44 and 2386 DF, p-value: <2e-16
anova(fit.all)## Analysis of Variance Table
##
## Response: unlist(PC1)
## Df Sum Sq Mean Sq F value Pr(>F)
## unlist(PC1ASVAB) 1 353 353 250.50 < 2e-16 ***
## gender 1 6 6 4.11 0.043 *
## Education05 1 49 49 34.54 4.8e-09 ***
## Income87 1 40 40 28.03 1.3e-07 ***
## Job05 30 98 3 2.31 7.3e-05 ***
## Weight05 1 1 1 0.79 0.373
## HeightFeet05 1 1 1 0.56 0.454
## HeightInch05 1 0 0 0.02 0.883
## Imagazine 1 0 0 0.35 0.552
## Inewspaper 1 10 10 6.84 0.009 **
## Ilibrary 1 3 3 2.21 0.137
## MotherEd 1 2 2 1.40 0.237
## FatherEd 1 0 0 0.00 0.973
## FamilyIncome78 1 0 0 0.17 0.680
## BMI05 1 0 0 0.13 0.714
## Residuals 2386 3367 1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(1,2))
plot(fit.all, 1)
plot(fit.all, 2)## Warning: not plotting observations with leverage one:
## 704
The Cancer Genome Atlas (TCGA), a landmark cancer genomics program by National Cancer Institute (NCI), molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. The genome data is open to public from the Genomic Data Commons Data Portal (GDC).
In this study, we focus on 4 sub-types of breast cancer (BRCA): basal-like (basal), Luminal A-like (lumA), Luminal B-like (lumB), HER2-enriched. The sub-type is based on PAM50, a clinical-grade luminal-basal classifier.
We will try to use mRNA expression data alone without the labels to classify 4 sub-types. Classification without labels or prediction without outcomes is called unsupervised learning. We will use K-means and spectrum clustering to cluster the mRNA data and see whether the sub-type can be separated through mRNA data.
We first read the data using data.table::fread() which
is a faster way to read in big data than read.csv().
Summary and transformation
table(brca_subtype)## brca_subtype
## Basal Her2 LumA LumB
## 208 91 628 233
b) Randomly pick 5 genes and plot the histogram by each sub-type.
num_gene <- ncol(brca)
# randomly select 10 gene
set.seed(10)
sample_idx <- sample(num_gene, 5)
# plot count number histogram for each gene
brca %>%
select(all_of(sample_idx)) %>% # select column by index
pivot_longer(cols = everything()) %>% # for facet(0)
ggplot(aes(x = value, y = ..density..)) +
geom_histogram(aes(fill = name)) +
facet_wrap(~name, scales = "free") +
theme_bw() +
theme(legend.position = "none")## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
c) Remove gene with zero count and no variability. Then apply logarithmic transform.
brca_subtype and the cluster labels.brca_sub_kmeans <- kmeans(x = brca_sub, 4)
#saveRDS(brca_sub_kmeans, "brca_kmeans.RDS")
#brca_sub_kmeans <- readRDS("output/brca_kmeans.RDS")
table(brca_subtype, brca_sub_kmeans$cluster)##
## brca_subtype 1 2 3 4
## Basal 1 17 3 187
## Her2 39 9 26 17
## LumA 392 82 154 0
## LumB 98 22 111 2
Spectrum clustering: to scale or not to scale?
irlba::irlba().pca_ret <- prcomp(brca_sub, center = T, scale. = T)
pve <- summary(pca_ret)$importance[2, 1:10]
plot(pve, type="b", pch = 19, frame = FALSE)
According to the elbow rule, we should use 4 principal components as by
that point the overwhelming majority of the variance is accounted
for.
b) Plot PC1 vs PC2 of the centered and scaled data and PC1 vs PC2 of the centered but unscaled data side by side. Should we scale or not scale for clustering process? Why? (Hint: to put plots side by side, use `gridExtra::grid.arrange()` or `ggpubr::ggrrange()` or `egg::ggrrange()` for ggplots; use `fig.show="hold"` as chunk option for base plots)
#Centered & Scaled
library(gridExtra)##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
p <- data.table(x = pca_ret$x[,1],
y = pca_ret$x[,2]) %>%
ggplot() +
geom_point(aes(x = x, y = y)) +
theme_bw() +
xlab("PC1") +
ylab("PC2") +
ggtitle("Centered & Scaled")
ppca_ret_centered_unscaled <- prcomp(brca_sub, center = T, scale. = F)
p2 <- data.table(x = pca_ret_centered_unscaled$x[,1],
y = pca_ret$x[,2]) %>%
ggplot() +
geom_point(aes(x = x, y = y)) +
theme_bw() +
xlab("PC1") +
ylab("PC2") +
ggtitle("Centered but Unscaled")
p2grid.arrange(p, p2, ncol = 2)No, we already took the log of the data so I believe it is not necessary to scale the data since the log transform already helps handle skewness and asymmetries in the data.
Spectrum clustering: center but do not scale the data
#kmean_ret <- kmeans(x = pca_ret_centered_unscaled$x[, 1:4], 4)
wss <- function(df, k) {
kmeans(df, k, nstart = 10)$tot.withinss
}
k.values <- 1:10
wss_values <- sapply(k.values,
function(k) kmeans(pca_ret_centered_unscaled$x[, 1:4], centers = k)$tot.withinss)
# extract wss for 1:10 clusters
wss_values <- map_dbl(k.values, function(k) wss(pca_ret_centered_unscaled$x[, 1:4], k))
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")b) Choose an optimal cluster number and apply kmeans. Compare the real sub-type and the clustering label as follows: Plot scatter plot of PC1 vs PC2. Use point color to indicate the true cancer type and point shape to indicate the clustering label. Plot the kmeans centroids with black dots. Summarize how good is clustering results compared to the real sub-type.
#Doesnt work
kmean_ret <- kmeans(x = pca_ret_centered_unscaled$x[, 1:4], 4)
p <- data.table(x = pca_ret_centered_unscaled$x[,1],
y = pca_ret_centered_unscaled$x[,2],
col = as.factor(brca_subtype),
cl = as.factor(kmean_ret$cluster)) %>%
ggplot() +
geom_point(aes(x = x, y = y, col = col, shape = cl)) +
geom_point(data = as.data.frame(kmean_ret$centers), aes(x = PC1, y = PC2), col = "black", shape = 21, size = 5) +
scale_color_manual(labels = c("LumA", "Her2", "Basal", "LumB"),
values = scales::hue_pal()(4)) +
scale_shape_manual(labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4"),
values = c(4, 16,17,18)) +
theme_bw() +
labs(color = "Breast Cancer Sub-type", shape = "Cluster") +
xlab("PC1") +
ylab("PC2")
ptable(brca_subtype, kmean_ret$cluster)##
## brca_subtype 1 2 3 4
## Basal 17 1 3 187
## Her2 9 27 27 28
## LumA 91 376 161 0
## LumB 22 99 110 2
The clustering seems to be able to generally discern the 4 subtypes into groups however it is not entirely accurate. There are definitely blurred regions and significant overlap where the results are not correct.
c) Compare the clustering result from applying kmeans to the original data and the clustering result from applying kmeans to 4 PCs. Does PCA help in kmeans clustering? What might be the reasons if PCA helps?
Based on analysis of the discrepancy tables for both and the plot it seems that they agree quite well with each other. This means we can use the 3 principal components rather than all the dimensions in the original dataset. This makes the kmeans process more efficient, easier to interpret, and can lend better results.
d) Now we have an x patient with breast cancer but with unknown sub-type. We have this patient's mRNA sequencing data. Project this x patient to the space of PC1 and PC2. (Hint: remember we remove some gene with no counts or no variablity, take log and centered) Plot this patient in the plot in iv) with a black dot. Calculate the Euclidean distance between this patient and each of centroid of the cluster. Can you tell which sub-type this patient might have?
x_patient <- fread("data/brca_x_patient.csv")
x_patient
x_patient <- log2(as.matrix(x_patient+1e-10))
x_patient_centered <- t(t(x_patient) - rowMeans(x_patient))
pc1score <- pca_ret_centered_unscaled$x[,1] * x_patient_centeredThis exercise is designed to help you understand the linear model using simulations. In this exercise, we will generate \((x_i, y_i)\) pairs so that all linear model assumptions are met.
Presume that \(\mathbf{x}\) and \(\mathbf{y}\) are linearly related with a normal error \(\boldsymbol{\varepsilon}\) , such that \(\mathbf{y} = 1 + 1.2\mathbf{x} + \boldsymbol{\varepsilon}\). The standard deviation of the error \(\varepsilon_i\) is \(\sigma = 2\).
We can create a sample input vector (\(n = 40\)) for \(\mathbf{x}\) with the following code:
# Generates a vector of size 40 with equally spaced values between 0 and 1, inclusive
x <- seq(0, 1, length = 40)
set.seed(1)
y <- 1+1.2*x+rnorm(40,0,sd=2)Create a corresponding output vector for \(\mathbf{y}\) according to the equation
given above. Use set.seed(1). Then, create a scatterplot
with \((x_i, y_i)\) pairs. Base R
plotting is acceptable, but if you can, please attempt to use
ggplot2 to create the plot. Make sure to have clear labels
and sensible titles on your plots.
lm() function. What are the true values of \(\boldsymbol{\beta}_0\) and \(\boldsymbol{\beta}_1\)? Do the estimates
look to be good?From the lm() function, we get \(\boldsymbol{\beta}_0 = 1.331\) and \(\boldsymbol{\beta}_1 = 0.906\), vs true
values of 1 and 1.2, so this isn’t a perfect estimate, which is a result
of the normal error with SD of 2.
x <- seq(0, 1, length = 40)
set.seed(1)
y <- 1+1.2*x+rnorm(40,0,sd=2)
pairs <- data.frame(x = x, y = y)
plot(x,y,
main="sample output vector for y according to y = 1 + 1.2x + normal error with sd=2")fit1 <- lm(y ~ x, data=pairs)
fit1##
## Call:
## lm(formula = y ~ x, data = pairs)
##
## Coefficients:
## (Intercept) x
## 1.331 0.906
The RSE is 1.79 which is fairly close to 2 (10.3% off)
rse <- summary(fit1)$sigma
rse## [1] 1.79
(rse-2)/2## [1] -0.103
summary(fit1)##
## Call:
## lm(formula = y ~ x, data = pairs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.662 -0.880 0.014 1.247 2.882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.331 0.557 2.39 0.022 *
## x 0.906 0.959 0.95 0.350
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.79 on 38 degrees of freedom
## Multiple R-squared: 0.023, Adjusted R-squared: -0.00272
## F-statistic: 0.894 on 1 and 38 DF, p-value: 0.35
The 95% CI is determined by the estimate +/- 1.96*SE. So here it would be .906 +/- 1.879: [-.973, 2.785]. Yes, this interval captures the true value of 1.2.
plot(pairs$x, pairs$y,
pch = 16,
xlab = "X",
ylab = "Y",
main = "Overlay of LS estimations and mean function")
abline(fit1, col="red", lwd=4) # many other ways.
abline(h=mean(pairs$y), lwd=5, col="blue") # add a horizontal line, y=mean(y)plot(fit1)See part above
The first assumption is linearity. The residuals seem to be relatively symmetric with respect to h=0 so this assumption is satisfied. The second assumption is homoscedasticity which also seems to be satsified as the residuals are evenly distributed across the band. Finally, the residuals seem to mostly fit the normal distribution according to the qq plot. However, there is some significant deviation at the bottom tail.
This part aims to help you understand the notion of sampling statistics and confidence intervals. Let’s concentrate on estimating the slope only.
Generate 100 samples of size \(n = 40\), and estimate the slope coefficient from each sample. We include some sample code below, which should guide you in setting up the simulation. Note: this code is easier to follow but suboptimal; see the appendix for a more optimal R-like way to run this simulation.
# Inializing variables. Note b_1, upper_ci, lower_ci are vectors
x <- seq(0, 1, length = 40)
n_sim <- 100 # number of simulations
b1 <- 0 # n_sim many LS estimates of beta_1 (=1.2). Initialize to 0 for now
upper_ci <- 0 # upper bound for beta_1. Initialize to 0 for now.
lower_ci <- 0 # lower bound for beta_1. Initialize to 0 for now.
t_star <- qt(0.975, 38) # Food for thought: why 38 instead of 40? What is t_star?
# Perform the simulation
for (i in 1:n_sim){
y <- 1 + 1.2 * x + rnorm(40, sd = 2)
lse <- lm(y ~ x)
lse_output <- summary(lse)$coefficients
se <- lse_output[2, 2]
b1[i] <- lse_output[2, 1]
upper_ci[i] <- b1[i] + t_star * se
lower_ci[i] <- b1[i] - t_star * se
}
results <- as.data.frame(cbind(se, b1, upper_ci, lower_ci))
results
# remove unecessary variables from our workspace
rm(se, b1, upper_ci, lower_ci, x, n_sim, b1, t_star, lse, lse_out) results$b1). Does the sampling distribution agree with
theory?# Inializing variables. Note b_1, upper_ci, lower_ci are vectors
x <- seq(0, 1, length = 40)
n_sim <- 100 # number of simulations
b1 <- 0 # n_sim many LS estimates of beta_1 (=1.2). Initialize to 0 for now
upper_ci <- 0 # upper bound for beta_1. Initialize to 0 for now.
lower_ci <- 0 # lower bound for beta_1. Initialize to 0 for now.
t_star <- qt(0.975, 38) # Food for thought: why 38 instead of 40? What is t_star?
# Perform the simulation
for (i in 1:n_sim){
y <- 1 + 1.2 * x + rnorm(40, sd = 2)
lse <- lm(y ~ x)
lse_output <- summary(lse)$coefficients
se <- lse_output[2, 2]
b1[i] <- lse_output[2, 1]
upper_ci[i] <- b1[i] + t_star * se
lower_ci[i] <- b1[i] - t_star * se
}
results <- as.data.frame(cbind(se, b1, upper_ci, lower_ci))
## inserting from above^
summary(results$b1)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.36 0.23 1.05 1.06 1.87 3.97
sample_sd <- sd(results$b1)/sqrt(length(results$b1))
ci_lower = mean(results$b1) - 1.96*sample_sd
ci_upper = mean(results$b1) + 1.96*sample_sd
paste("Lower CI interval: ",ci_lower)## [1] "Lower CI interval: 0.837966630583203"
paste("Higher CI interval: ",ci_upper)## [1] "Higher CI interval: 1.27250590645873"
We see that the mean of the sampling distribution is 1.07 while the theoretical value should be 1.2. The estimate is not perfect but we do see that true value is captured within a 95% confidence interval of the sampiling mean, therefore we could say that the sampiling distribution agrees with theory.
sample_number = c(1:100)
p1 <- ggplot(results, aes(sample_number, b1)) + geom_point() +
geom_errorbar(aes(ymin = lower_ci, ymax = upper_ci))
p1 + ggtitle("Confidence Interval for Each Sample")total_including <- sum(results$lower_ci<=1.2 & results$upper_ci>=1.2)
paste("Number of Confidence Intervals including true value: ",total_including)## [1] "Number of Confidence Intervals including true value: 96"